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Abstract — We propose two graphical models to represent a 
concise description of the causal statistical dependence structure 
between a group of coupled stochastic processes. The first, 
minimum generative model graphs, is motivated by generative 
models. The second, directed information graphs, is motivated 
by Granger causality. We show that under mild assumptions, 
the graphs are identical. In fact, these are analogous to Bayesian 
and Markov networks respectively, in terms of Markov blankets 
and I-map properties. Furthermore, the underlying variable 
dependence structure is the unique causal Bayesian network. 
Lastly, we present a method using minimal-dimension statistics 
to identify the structure when upper bounds on the in-degrees 
are known. Simulations show the effectiveness of the approach. 

Index Terms — Graphical models, network inference. Granger 
causality, generative models. 



I. Introduction 

RESEARCH in many disciplines, including biology, eco- 
nomics, social sciences, computer science, and physics, 
involves studying networks of interacting, stochastic processes. 
The human brain, stock market, and Internet are some ex- 
amples. For numerous research problems, it is important to 
characterize the structure of these large networks of interacting 
processes that elucidates the extent to which the past of some 
processes affects the future of others. In particular, it can be 
useful to have a succinct representation of the structure, such 
as a simple graphical model of the network. For instance, each 
node could represent a process, with directed edges between 
the nodes representing directions of influence. 

There is a large body of research on developing well-defined 
graphical representations of networks of random variables. 
Markov networks, Bayesian networks, dynamic Bayesian net- 
works, and chain graph models are some well-known exam- 
ples. In these, random variables are represented as nodes in the 
graph. Edges represent conditional dependence relationships 
between the variables. These graphical models can be used 
for arbitrary sets of random variables, but the relationships 
they show are mutual. These graphs have been successfully 
used to represent the structure of networks of static stochastic 
systems. Some applications include object recognition Q, 
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error-correcting codes [2J, cellular networks |3|, and medical 
diagnostics 141 . 

Markov networks and Bayesian networks in particular rep- 
resent two different perspectives on the structure of networks 
of random variables. Markov networks directly represent the 
dependence between each pair of variables, conditioned on 
all other variables. Bayesian networks represent factorizations 
of the joint distribution, so each variable potentially depends 
on preceding variables, and then the conditional terms are 
reduced. See |5 1 for an overview of graphical models. 

Networks of random variables do not lend themselves 
to concise representation of interacting stochastic processes, 
which are sets of random variables indexed over time. Rep- 
resenting each random variable as a node results in large, 
cumbersome graphs growing with time. Moreover, such a 
representation will not aid with visualization of the structure 
of inter-dynamics of coupled time series. For instance, it could 
be difficult to see how the past of some processes affects the 
future of others. 

Examples of such networks of dynamic stochastic systems 
include financial networks, computer networks, social net- 
works, and biological networks. In financial markets, invest- 
ment strategies are based on predictions of how past activity of 
some stocks effects the future activity of others. For instance, 
suppose that the price of corn, a component of chicken feed, 
strongly influences the price of chicken. Simulated data is 
shown in Fig [T] Knowing this causal relationship can lead 
to better investment strategies than simply knowing the two 
are correlated. 

Another example is with computer networks. To investi- 
gate a computer attack, for instance, simply knowing the 
connectivity structure alone is not sufficient to identify the 
source. For legal proceedings involving warrants and seizures, 
it is significantly more useful to know the traffic influence 
structure (see Fig [2]). Likewise, many social networks have 
non-mutual relationship structures. Twitter, email networks 
within a company, and gossip networks are some examples 
where the connectivity alone does not convey the structure 
(see Fig [3]). 

One last example is from biological networks. The primate 
visual system functions through a complex feedback network 
|6|. As light enters the eye, information is sent from the 
eye to other parts of the brain, where initially low-level 
processing occurs, such as processing simple shapes. Then 
information is sent to higher-level processing, such as object 
and facial recognition. Simply considering correlation between 
these parts of the brain might suggest information is sent in 
a feed-forward manner, from the eye to low-level to high- 
level processing areas. However, it has been shown that visual 
processing relies extensively on feedback between these areas 
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Fig. 1. A plot of simulated corn and chicken prices. The price of chicken 
depends on the cost of corn, but not vice versa. Understanding this causal 
relationship could lead to better investment strategies. 
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Fig. 2. The traffic influence structure of a computer network during an attack. 
Unlike the connectivity structure alone, the causal influence structure clearly 
reveals the attacker. 



(see Fig [4]). 

As illustrated in these examples, the structure of networks 
of dynamic stochastic systems is not fully captured by con- 
sidering mutual relationships. Thus, while criteria such as 
mutual information are used to identify relationships in static 
stochastic systems, another criterion is needed to define the 
minimal description of the relationship between the past of 
some processes and the future of others. 

Clive Granger, a Nobel laureate, proposed a framework 
for this purpose. He developed a methodology for deciding 
when, in a statistical sense, one process X causally influences 
another process Y in a network. His methodology is based 
on quantifying how much causal side information of X, its 
past, helps in sequentially predicting the future of Y. His 
framework also accounts for other processes influencing Y, 
thus characterizing which influences are direct. Researchers 
have applied his framework in a number of fields, including 
biology, economics, and social sciences Q, ISl, flOi , 
(TPl, fT2l. For most applications, however, the analysis was 
restricted to the context of linear multivariate models. As we 
will show, in a general formulation of the sequential prediction 
task with causal side information, an information theoretic 
quantity known as causally conditioned directed information 
im is precisely the value of having access to the causal side 




Fig. 3. To understand how trends, news, and gossip transfer dynamically in 
a social network, the relationship structure needs to be understood, not just 
the connectivity structure. 




Fig. 4. Visual processing in the brain occurs with information propagating 
between various parts of the brain. In addition to feed-forward pathways from 
the eye to higher-level processing regions of the brain, feedback has been 
shown to play a crucial role. The feedback might not be evident from only 
considering purely correlative relationships]^ 

information. 

We will thus propose a graphical model based on causally 
conditioned directed information. Each node will represent a 
random process. An edge will be drawn from a node X to a 
node Y if X influences Y in the sense of Granger, quanti- 
fied by causally conditioned directed information. The causal 
conditioning is on all other processes to account for indirect 
influences. The resulting graphs, called directed information 
graphs, are analogous to Markov networks in terms of Markov 
blankets, graph separation, and I-map properties. 

We further justify that directed information graphs represent 
the causal influence structure by showing that they are equiv- 
alent to graphs based on generative models. For the latter, the 
joint distribution is factorized over time, analogous to a state- 
space model of coupled differential equations. The terms in 
the factorization are then reduced. These minimal generative 
models are analogous to Bayesian networks. Interestingly, 
while Markov and Bayesian networks are distinct, offering 
different perspectives on mutual dependence structure between 
variables, directed information graphs and minimum genera- 
tive model graphs offer the same fundamental perspective on 
the causal influence structure between processes. 

Since the directed information graphs and minimum genera- 
tive model graphs are equivalent, either can be used to identify 
the structure of the system. Despite this flexibility, finding 
the structure by either definition requires computing quantities 

^Figure adapted from ' ' http://c0mm0ns.wikimedia.0rg/wiki/File:Human_ 
Brain_sketch_with_eyes_aiid_cerebrellum. svg' '"[ 
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using the full joint distribution. A natural question is whether 
instead of the full joint distribution, lower-dimensional statis- 
tics could be used when there is some knowledge about the 
structure. For instance, some gene networks, ecosystems, and 
computer virus infection networks have random graph struc- 
tures, and some metabolic, neuronal, and social networks have 
small- world structures |14|. Optimal transportation systems 
and the blood vessel system are known to have tree structures 

Trees are among the simplest of structures, and can be 
found using only pairwise statistics between processes. Chow 
and Liu proposed a method for finding the best tree struc- 
tured Bayesian network approximation for a set of random 
variables 1 17 |. Recently, analogous approaches for recovering 
tree structures in context of networks of random processes 
were suggested in |18 | and |19|. These approaches use pair- 
wise statistics and a coupled, global optimization step. We 
will propose a method that recovers the structure using the 
minimal-dimension statistics necessary when the in-degrees of 
the processes are upper bounded. Also, the method will not 
require any coupled optimization step. 



A. Our Contribution 

We propose two graphical models for identifying the struc- 
ture of networks of causally interacting, stochastic, dynamic 
processes. The first, minimum generative model graphs, is 
based on reduced factorizations of the joint distribution of 
the network over time. It is motivated by simplifying sets of 
coupled differential equations. The other, directed information 
graphs, is motivated by Granger causality. We show that 
directed information quantifies Granger causality in a general 
prediction framework. Causal influences between each pair 
of processes is directly queried using directed information. 
Moreover, we show that these two graphical models are anal- 
ogous to Markov and Bayesian networks, respectively. Unlike 
Markov and Bayesian networks, however, the two proposed 
graphical models are equivalent, suggesting they encode a 
fundamental causal influence structure of the network. 

We discuss how the structure of directed information graphs 
is related to causal Markov chains. We show that the variable 
dependence structure underneath a directed information graph 
is a dynamic Bayesian network. Additionally, we demon- 
strate that directed information graphs, similar to Markov and 
Bayesian networks, form a type of independence map (I-map) 
with a graphical separation criterion similar to d- separation for 
Bayesian networks. 

Finally, we propose an efficient method to identify the struc- 
ture of directed information graphs when the in-degrees are 
bounded. The method uses the minimal-dimension statistics 
necessary to find the structure. For a process Y with at most 
K parents, the method uses only {K + l)-wise statistics. 
Furthermore, the method finds the parents of Y independently 
of finding the parents of other processes. We also demonstrate 
this method by inferring the minimal generative model struc- 
ture of a network of causally interacting processes. 



B. Related Work 

1) Directed information: Directed information was first 
introduced by Marko L20J and independently rediscovered by 
Rissanen and Wax 1211 . Rissanen and Wax proposed their 
work as an extension of Granger's framework |22|, which in 
turn was based on Wiener's work L23.I . Directed information 
was later formalized by Massey 1241 . It plays a fundamental 
role in communication with feedback |20|, II13L 1251 , 1261 , 
EH, (271, gambling with causal side information 1281, 1291, 
control over noisy channels |[3Ql , ISTl , |[32l , 1251 , and source 
coding with feed forward 1331 , 1291 . 

Directed information has already been used in some applica- 
tions to infer causal relationships between nonlinear processes. 
In Marko 's original paper [W], directed information is used to 
analyze social relationships between primates. Other applica- 
tions of directed information include analysis of neuroscience 
data (Ml, (351, f36|, |37|, gene regulatory data (38l, EH, and 
video recordings Ii40il . Parametric methods have been proposed 
for estimating the directed information in context of point 
processes (34l , l35l , l37l . Additionally, a universal estimation 
technique applicable to discrete time, finite alphabet processes 
was proposed in |41|. 

2) Graphical models and Granger causality: There is a 
large body of literature on graphical models for representing 
the mutual dependence structure of sets of random variables. 
We will only reference Pearl |42 | and Koller and Friedman O, 
which provide good foundations for the theory and techniques. 
These models have been applied to networks of causally 
interacting processes. One approach is dynamic Bayesian 
networks, where each variable for each process is represented 
as a node in the graph (43l . However, these graphs unravel to 
larger and larger size with time, depend upon a spatial ordering 
of the processes using the chain rule, and do not succinctly 
elucidate relationships between the past of some processes and 
the future of others. 

In this paper, we use the the framework of Granger causality. 
An alternative framework for identifying causal interactions 
between variables is based on the principle of intervention 
|44|. By fixing certain variables to have specific values, 
observing how the statistics of other variables change can be 
used to infer some causal relationships. Ay and Polani (45l 
propose a measure for how strong such causal effects are in 
the context of processes. Some connections between Pearl's 
framework and Granger causality (as measured by directed 
information) are explored in (46l. 

There are a class of graphical models developed to repre- 
sent Granger's principle, known as Granger causality graphs 
(47l , (48l , |49l . These are mixed graphs (both directed and 
undirected graphs) for multivariate autoregressive time series. 
Nodes represent processes. The directed edges represent causal 
influences, as measured by Granger causality. The undirected 
edges represent instantaneous correlation. 

In 1 47 1, it is suggested that, conceptually. Granger causality 
graphs could be employed for nonlinear relationships. How- 
ever, it is mentioned that some properties of the graphical 
model would not hold. They also suggested it would be 
impossible to infer structures where causal influences were 
nonlinear without assuming specific models. 
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There was a recent work on Granger causality graphs 
that proposed using directed information |50|. However, it 
justifies using directed information conceptually, motivated 
by equivalence of Granger causality and directed information 
in the case of jointly Gaussian processes ISTIl , and does not 
identify properties of the graph. Independently, 1521 showed 
the relationship between Granger causality and transfer en- 
tropy. Directed information is the time average of transfer 
entropy. Transfer entropy was proposed by Schreiber 1531 , 
independently of directed information. 

Other works proposing the use of directed information to 
quantify Granger causality for networks of general processes 
include 1341 , 1351 , l36l , l4Ql , independent of each other and 
of f50l, f5r|. The work f34l also proposed a graphical model, 
though different than what is proposed here. 

3) Graph structure identification: When an upper bound on 
the in-degree is known for the structure of a set of random 
variables, there are algorithms for Markov and Bayesian 
networks that use local statistics (from pairwise up to the size 
of the upper bound). They recover the Markov blankets and 
consequently the structure. An early work using this approach 
is the SGS algorithm l54l . More recent works are discussed 
in l55l and f5l. An alternative approach to identifying sparse 
structures of networks of causal processes uses group Lasso, 
based on the model selection technique Lasso |56|. 

C. Paper Organization 

In Section lU we establish definitions and notations. We also 
review basic properties of Bayesian and Markov networks 
that will be relevant for our proposed graphical models. In 



Section III we formally introduce the two proposed graph- 
ical models. We discuss how directed information precisely 
quantifies Granger causality in a general, sequential prediction 
setting. We then show the two proposed graphical models 



are equivalent. In Section |IVj we discuss how the variable 
structure underlying the proposed graphical models is the 
unique, causal Bayesian network. We also formulate how 
causally conditioned independences are represented in directed 
information graphs by describing their I-map properties. In 
Section |V| we consider algorithms for identifying the graphical 
models. We describe the procedure that identifies the directed 
information graph with minimal-dimension statistics when 
upper bounds on the in-degrees are known. In Section IVl| we 



use this efficient method to infer the structure of a simulated 
network of causally interacting processes. In the appendices, 
we provide the proofs of the lemmas and theorems in the 
paper. 

II. Background 
A. Notation and Information Theoretic Definitions 

• For a sequence ai, a2, . . ., denote {ai^ . . . ^aj) as aj and 

• Denote [m] = {1, . . . , m} and the power set 2^^^ on [m] 
to be the set of all subsets of [m]. 

m For any Borel space Z, denote its Borel sets by S(Z) and 
the space of probability measures on (Z, S(Z)) as 7^ (Z). 



Consider two probability measures P and Q on V{Z). 
P is absolutely continuous with respect to Q (P Q) 
if Q{A) = implies that ¥{A) = for all A e S(Z). 
If P <C Q, denote the Radon-Nikodym derivative as the 
random variable ^ : Z ^ R that satisfies 

JzeA 

The Kullback-Leibler divergence between P G 7^ (Z) and 
QeV(7.) is defined as 



D 



log— -(^)P(dz) 



if P <C Q and oo otherwise. 

Throughout this paper, we will consider m random 
processes where the zth (with i G {1, . . . , m}) random 
process at time j (with j G {1, . . . , n}), takes values in 
a Borel space X. 

For a sample space Vt, sigma- algebra and probability 
measure P, denote the probability space as ((^, J^, P). 



Denote the ith random variable at time j by 



X, the ith random process as = (X^^i, . . . , X^^^), 
the subset of random processes Xj = (X^ : i G X), 
and the whole collection of all m random processes as 
X = X[^] . Denote the whole collection of all m random 

processes from time j to / as X^ . 
The probability measure P thus induces a probability 
distribution on X^ j given by Pxi^j{') ^ ^(^)' a joint 
distribution on X^ given by Pxi(-) ^ V (X"^), and a joint 
distribution on X;^: given by Px^(-) G V (X'^l^). 
With slight abuse of notation, denote Y = X^ for some i 
and X = X/e for some i ^ k and denote the conditional 
distribution and causally conditioned distribution of Y 
given X respectively as 

iV|x=x(^y) = iV|x(^^y|x) 

n 

= \{PYAY^-\x-{dy,W-\x^) (1) 

^Y||x=x(<^y) = PY||x(c^y||x) 

n 

= \[PY,\Y.-\xo-^{dy,W-]x^-^). (2) 

Note the similarity between regular conditioning ([T]) and 
([2]), except in causal conditioning the future {x^) is not 
conditioned on (Ch. 3 of |13|)|5 
With slight abuse of notation, W 
[m]\{z} with W = Xl-^l"^. Consider two sets of causally 
conditioned distributions {iV||w=w ^ P (^) • w G W} 
and {Qy||w=w ^ ^ (^) • w G W} along with a 
marginal distribution Pw ^ V (W) • Then the conditional 



X^: for some X C 



^Note the slight difference in conditioning upon x^~^ in this definition as 
compared to in the original causal conditioning definition. The purpose 
for doing this will become clear later in the manuscript. 



KL divergence is given by 

^(^y||wIIQy||w|Av) 

= / I^(Py||w=w||Qy||w=w) Av(c^w). 

The following Lemma will be useful throughout: 
Lemma 2.1: 1^(Py||w||Qy||w| Av) = if and only if 
PY\\w=yv{dy) QY\\\v=yv{dy) with Pyv probability 
one. 

• Let X = for some z, Y = X/. for some k and 
W = Xj some X C [m]\{i^k}. The mutual informa- 
tion, directed information ll24ll . and causally conditioned 
directed information |[T3ll are given by 

I(X;Y)^I)(Py|x||Py|Px) 
I(X^Y)4I)(Py||x||Py|Px) (3) 
I(X ^ Y||W) ^ I^(Py||x,w||Py||w|Px,w) • (4) 

Conceptually, mutual information and directed informa- 
tion are related. However, while mutual information quan- 
tifies statistical correlation (in the colloquial sense of sta- 
tistical interdependence), directed information quantifies 
statistical causation. We later justify this statement show- 
ing that directed information is a general formulation of 
Granger causality. Note that I(X;Y) = I(Y;X), but 
I(X Y) 7^ I(Y X) in general. As a consequence 
of Lemma \L\\ and (|4]), we have: 

Corollary 2.2: I(X ^ Y||W) = if and only if Y is 
causally independent of X causally conditioned on W : 

PY||X=x,W=w(G^y) = PY||W=w(c^y), Px,W - CL-S. 

Equivalently, we denote that X ^ W ^ Y forms 
a causal Markov chain. This is analogous to Markov 
chains, denoted as X - W - Y, where I(X; Y| W) = 
if and only if Y is independent of X conditioned on W : 

PY|X=x,W=w(G^y) = PY|W=w(<^y), Px,W - CL.S. 

• Let G = (V^E) denote a directed graph. For each edge 
(ii, v) e E, u is called the parent and v is the child. 

We next review basic properties of Markov and Bayesian 
networks, as the proposed graphical models are analogous to 
these. The remaining discussion in this section follows from 
Chapter 3 of 1421. 

B. Bayesian Networks 

To motivate Bayesian networks, consider the following 
example. 

Example 1: Let {W^X^Y^ Z} be a set of four random 
variables with a positive joint distribution Pw,x,y,z- Let their 
relationships be of the form: 




(a) A Bayesian network. 



(b) The Markov network. 



Y 
Z 



W 
W 



X 



Fig. 5. Bayesian and Markov networks for Example [T] 



as: 

Pw,x,Y,z{w,x,y,z) 

Pw{uj)Px\w{x\w)PY\w,xiy\'i^^ x)Pz\w,x,y{A^^ y) 

Pw(w)Px{x)PY\W,x{y\^,x)Pz\w{A^) 

Since the distribution cannot be reduced further, these depen- 
dencies represent the structure. This structure can be depicted 
graphically with directed edges corresponding to dependen- 
cies, as is shown in Figure |5(a)| 

Figure |5(a)| is an example of a Bayesian network. 
Bayesian networks are directed graphs representing 
conditional dependencies in a reduced factorization of 
the joint distribution. Note that the Bayesian network 
in Example [T] depended on how the chain rule was 
applied to the joint distribution. Other orderings 
correspond to different graphs. For instance, the 
chain rule could be applied as Pw,x,Y,z{y^^x^y^ z) = 
PY{y)Pz\Y{Ay)Px\Y,z{x\y, z)Pw\x,Y,zMx, y, z). Here, 
however, no term can be reduced. 

Also note that there are no directed cycles. This is because 
a variable can only have incoming arrows from variables with 
smaller index and outgoing arrows to variables with a larger 
index. Lastly, although the graph has directed edges, the edges 
correspond to conditional dependence relationships, which are 
mutual. We now discuss an alternative representation. 

C. Markov Networks 

Bayesian networks are one method to graphically represent 
the conditional dependence structure of a set of random 
variables. It is based on factorizations of the joint distribution, 
where unnecessary dependencies are removed. An alternative 
method is known as Markov networks, which are undirected 
graphs. An edge is drawn between each pair of variables that 
are dependent, given knowledge of all other variables. Condi- 
tional mutual information is used to quantify dependence. 

Consider the system in Example [T] (see Figure [5]). For 
the Markov network, each edge is determined by testing 
"globally" conditioned dependencies. For instance, 

Y ± Z I W,X 
X ± Z \W,Y 

w ^ X I r, z. 



where ^ and are noises and W, X, ^, and are all inde- 
pendent. Consequently, the joint distribution can be factorized 



The remaining three pairs are conditionally dependent. Note 
that the last formula above, for instance, corresponds to 



1{W]X\Y,Z) > 0. The Markov network for Example [T] is 



shown in Figure 5(b) 



Markov networks are undirected graphs and do not de- 



pend on variable orderings. Also, as shown in Figure 5(b) 



Markov networks can have loops. However, note that W 
and X share an edge in the Markov network. Even though 
they are marginally independent, Y depends on both. Thus, 
conditioning on Y induces dependence between W and X. 

The Bayesian network in Example [T] did not have an edge 
between W and X, although some other variable orderings 
would result in such an edge. Although describing the same set 
of random variables, Markov and Bayesian networks represent 
different mutual dependencies. We next consider how to suc- 
cinctly represent how the past of some processes statistically 
affect the future of others in a network of random processes. 



III. Minimal Generative Model Graphs and 
Directed Information Graphs 

We now consider the specific case of physically causal, 
stochastic dynamical systems. Here, there are multiple pro- 
cesses, each of which is a indexed collection of random 
variables, and the indexing corresponds to time. Our goal 
is to graphically represent the structure of causal influences 
between processes. We consider two approaches. 

We first introduce a graphical model based on generative 
models for stochastic dynamical systems. It is motivated by 
the process of reducing dependencies in coupled differen- 
tial equations for deterministic dynamic systems. The other 
graphical model is motivated by the framework of Granger 
causality, which directly tests for causal influence. Causally 
conditioned directed information is shown to be a general 
formulation of Granger causality and used as the edge selec- 
tion criterion. Thus, there is an analogy between the graphical 
models proposed in this section and Bayesian and Markov 
models respectively. Although Bayesian and Markov models 
are different representations of the structure, we will lastly 
show that despite different motivations and methodologies, the 
two proposed graphical models are identical. These graphical 
models were first discussed in preliminary work in L57J . 



A. Minimal Generative Model Graphs 

Stochastic dynamical systems have indexing both across 
processes and across time. They have a natural formula repre- 
sentation - the coupled differential equations that characterize 
the dynamics of the system over time. Consider the following 
example of a simple system: 

Example 2: Let Xt,yt, and Zt be three processes comprising 
a physical, dynamical system. The evolution of the processes 
over time can be fully described by coupled differential 
equations: 



X = f{x,y,z) 
y = 9{x,y,z) 
z = h{x,y,z). 



(5a) 
(5b) 
(5c) 




Fig. 6. A graphical model of the causal influence structure of the stochastic 
dynamical system of Example [2] 



For small A, this becomes 



xt+A = xt^Af{xt,yt,zt) 
yt+A = yt^ ^9{xt,yt,zt) 
zt+A = Ah{xt,yt,zt). 



(6a) 
(6b) 
(6c) 



Note that the dynamics in this system are causal. Given the 
full past of the system, Xt-\-A can be completely generated 
without knowledge of yt-\-A or ^t+A- Thus, although these 
equations are coupled, knowledge of the full past decouples 
the dynamics of the near future. Suppose that the processes 
are not each dependent on the past of all other processes. For 
instance, suppose Xt and yt influence each other independently 
of Zt, and that Zt does not depend on Xt'. 

^ = f{x,y, ) 
y = 9{x,y, ) 
z = h{ ,y,z). 

For small A, this becomes 

xt+A = xt^ Af{xt,yt, ) 
yt+A = yt^ Ag{xt,yt, ) 
zt+A = zt^ Ah{ ,yt,zt). 

In this scenario, we can construct a directed graph where, 
for each node, we place incoming edges from those nodes 
that describe the minimal set of processes whose past affect 
the near future of process pertaining to the node of interest. 
The physically causal dependencies represented in the reduced 
formulas characterizes the structure of this dynamical system. 
See Figure [6] 

Now consider the system analogous to ([5]) but where 
(X,F, Z) = (Xt,Ft,Z^ : t > 0) is a multi-dimensional Ito 
process satisfying the stochastic differential equations 11581 : 

dX = f{Xt,Yt,Zt)dt^dBt (8a) 
dY = g{XuYuZt)dt^dCt (8b) 
dZ = h{Xt,Yt,Zt)dt^dDt (8c) 

where (B^C^D) = {BtjCt^Dt : t > 0) are independent 
Wiener processes. An analogous relation holds for a multi- 
dimensional jump processes where (5, C, D) = {Bt^ Ct^Dt : 
t >0) are independent Poisson processes. 
For sufficiently small A, this becomes: 

Xt+A = Xt+Af{Xt,Yt,Zt) + {Bt+A-Bt) 
Yt+A =Yt + Ag{Xu Yt, Zt) + (Q+a - Q) 
Zt+A = Zt + Ah{Xt, Yt, Zt) + (A+A - A) • 
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Note that due to the independent increments property of 
the Wiener and Poisson process, (^t+A — Bt) is statistically 
independent of {XrjY^jZr r < t) and is also statistically 
independent of (C^+a — Ct) and {Dt-\-A — Dt). 

Thus, analogous to ([6]), knowledge of the full past statisti- 
cally decouples the dynamics of the multi-dimensional process 
in the near future. 

If we discretize time into units of length A, then the 
stochastic dynamics are fully captured by the joint distribution 
Px^,Y^,z^- Analogous to the coupled differential equations, 
the chain rule can be applied over time to obtain a similar 
representation: 

Px^^Y-^z-{dx'',dy'',dz'') (9) 

n 

=WPxj,Y,,Zj\xj-\YJ-\zj-Adxj4yj,dzj \x^~^, y^~\ z^~^). 

Due to the dynamics of the multi-dimensional process in the 
near future being statistically decoupled given knowledge of 
the full past, we have: 

(10) 

n 

= Yi Pxj\XJ-\YJ-\ZJ-^{dxj\x^~^ ,y^~^ , z^~^) 

xPYj\xj-\Yj-\zj-^{dyj\x^~^,y^~^,z^~^) 
xPzj\XJ-\YJ-\zo-^{dzj\x^~^,y^~^,z^~^). 

Now suppose we have a multi-dimensional stochastic dif- 
ferential equation of the form: 

dX = f{Xt,Yt)dt^dBt 
dY = g{Xt,Yt)dt^dCt 
dZ = h{Yt,Zt)dt^dDt. 

For sufficiently small A, this becomes: 

Xt+A = Xt^ Af{Xu Yt) + (5,+A - Bt) 
r,+A =Yt^ Ag{Xt,Yt) + (C,+A - Ct) 
Zt+A = Zt^ Ah{Yt, Zt) + (A+A - Dt) . 

With this, we can further remove the unnecessary depen- 
dencies in the discrete-time model to obtain: 

Px-,Y-,z-{dx^^dy^,dz^) (12) 

n 

= llPx,\x.-^X^-^{dxj\x^-\y^-') 

xPY^\xj-\YJ-^{dyj\x^~\y^~^) 
xPz^\YJ-\zj-^{dzj\y^~^ , z^~^). 

This causal dependence structure is shown graphically in 
Figure |6] 

We can extend this stochastic differential equation relation 
to the following lemma: 

Lemma 3.1: Let B be an m-dimensional Brownian motion 
and denote Ft as the sigma-algebra generated by {B^^ r < t). 
Assume that u = [uj{t^uj) : t G [0, T], 1 < j < m] and 
V = [vij{t^uj) : t G [0, T], 1 < z, j < m] have the property 
that Ut and Vt are -measurable for all t G [0,T]. Assume 



that Vij{t^uj) = for i ^ j. Then the m-dimensional Ito 
process X = [Xj{t^uj) : t G [0, T], 1 < j < m] satisfying the 
stochastic differential equation of the form 

dXt = udt + vdBt 

has the conditional independence property 

m 

P(X, G X . • • X A^\J^t) = n^(^^^^ ^ ^^\^t) . 

Proof: Note that the integral equation becomes f58^, Ch 

4] 

Xt = Xq -\- / u{s^uj)ds-\- / v{s^uj)dBs. 
Jo Jo 

Also note that we can pick the left endpoint in the Ito 
definition |[58j Ch 4] of v{s,u)dBs, so the latter becomes 

lim V Vt,_ABu - Bt,_,) 

n^oo ^ — ^ 

where {Vn} is a sequence of partitions of [0,T]. The proof 
then follows directly by the assumed diagonal structure of v, 
the fact that an m dimensional Wiener process is a set of m 
independent Wiener processes, and the independent increments 
property of the Wiener process. ■ 

Definition 3.2: A distribution Px is called positive if there 
exists a measure ^ such that Px <C (j) and (x) > for all 
X in the support of Px- 

Assumption 1: For the remainder of this paper, we only 
consider joint distributions that are positive and that satisfy 

Remark 1: The assumption of positivity is to avoid de- 
generate cases that arise with deterministic relationships. For 
instance, suppose X is a random process with a continuous 
distribution and Y represents X passed through a deterministic 
invertible system. Then X and Y have joint distribution that 
is not asymptotically continuous with respect to the Lebesgue 
measure because the distribution of Y given X is a point mass 
and vice versa. The assumption ([TO]) is implicit in Marko's 
Kirchoff current laws pertaining to directed information ll2Qll . 
Specifically, the assumption ([TO]) must hold in order to equate 
f20^, eqn 14] with |20, eqn 13]. Moreover, the assumption 
of ([To]) holds for any continuous -time generative model of a 
stochastic differential equation satisfying ([5]) where B^C^D 
are Wiener or Poisson processes. As such, these are rather 
modest assumptions that describe models of many important 
physical systems. 

We now generalize this process. Consider a causal, stochas- 
tic dynamical system of m processes with a positive joint 
distribution. Let X denote the set of random processes and let 
Px denote their joint distribution. The dynamics of the system 
is fully described by Px . Px could initially be factorized over 
space (index of the processes) or time. Like in ([9]), we first 
apply the chain rule over time: 

n 
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With Assumption [T] given the full past, the dynamics of the 
processes in the near future decouple analogous to (T0^ \ 

n m 

Px(^x) = nn^^..ix-^(^^Mix^"')- (13) 

Equivalently, using causal conditioning notation ([2]): 

m 

P^{d^) = n^x.||X[„„(.,(t^Xi ||x[^]\{,j). 

i=l 

By factorizing over time first, each is still conditioned on 
the past of every other process. Like in ([12]), we now remove 
unnecessary causal dependencies. For each process X^, let 
A{i) C [m]\{i} denote a subset of other processes. Define 
the corresponding induced probability measure Pa- 

m 

PA{d2c) = n^'x.lix.^.jlrfx, II x^(i)). (14) 

We want to pick the subsets {A{i)}^^ so that their cardinal- 
ities are small, while still capturing the full dynamics: 

D{P^\\Pa)=0. (15) 

In Example[2j the A{i)'s would correspond to {Y}, {X}, and 
{Y} for X, Y, and Z, respectively. 

Definition 3.3: Under Assumption [T] for a joint distribution 
Px, a minimal generative model is a function A : [m] 2^^^ 
where the cardinalities are minimal such that ( p3] ) 

holds. 

Note that the A(z)'s together must satisfy ( p3] ), suggesting 
that choice of A{i) for a particular i might be restricted by 
what was already chosen for {A{j)y~}-^. However, by non- 
negativity of KL divergence, ([15]) corresponds to 

^(^'x.||x,„,,„||Px.||x.,JPx,„,,„) = (16) 

for all i G [m]. Thus, the sets can be chosen separately 
to satisfy this condition. Furthermore, under our assumption, 
minimum generative models are unique. 

Lemma 3.4: For any distribution Px satisfying Assump- 
tion [T] the minimal generative model is unique. 

The proof appears in Appendix |A] Bayesian networks 
depend on the indexing of the variables. For minimum gen- 
erative models, however, by first factorizing over time, the 
indexing of the processes becomes irrelevant. We now define 
a corresponding graphical model. 

Definition 3.5: A minimum generative model graph is a 
directed graph for a minimum generative model, where each 
process is represented by a node, and there is a directed edge 
from X/e to Xi for i,k e [m] iff k e A{i). 

Note that unlike Bayesian networks, minimum generative 
model graphs can have directed loops, as is the case in 
Figure [6] 

Minimum generative model graphs represent reduced fac- 
torizations of the joint distribution of the system. They encode 
causal relationships by only depicting as parents in the graph 
those subsets of processes that are necessary and sufficient 
to describe the full dynamics. This graphical model was 
motivated by reducing coupled differential equations for de- 
terministic systems. We next propose an alternative graphical 



model based on the framework of Granger causality, which 
directly tests for causal influences between processes. We 
show that causally conditioned directed information captures 
Granger's principle and use it as an edge selection criterion. 

B. Granger Causality and Directed Information Graphs 

In 1969, motivated by earlier work by Norbert Wiener 
1231 , Nobel laureate Clive Granger proposed a framework for 
identifying when one process statistically "causes" another 
1 22 1: 

"We say that X is causing Y if we are better 
able to predict [the future of] Y using all available 
information than if the information apart from [the 
past of] X had been used." 
While this definition is general, its previous formulations 
have been restricted to specific classes of models, such as 
autoregressive linear models. Specifically, Granger's setup was 
as follows II22II . Suppose we have three processes X, Y, and Z. 
Granger posited the development of two autoregressive linear 
models: 

Yt = ^arYt-r + Yl br'Xt-r'^Cr'Zt-r'^Et (17) 
r>0 r'>0 

Yt = Y^arYt-r + Yl ~^r'Zt_r'^Et. (18) 
r>0 r'>0 

Next, he proposed the development of least squares so that 
each predictor performs rationally to find the best coefficients 
for (a, 6, c) in ( [TT] ) and analogously for (a, c) in ( [TS] ). The 
performance of the predictors is measured by the variability 
in Et and Et, given by variances and a^. Note that in 
general, after the least squares fit, > cr^. In the event that 
a 2 — removing that past of X had no detrimental effect 
on predicting the future of F. As such. Granger suggested the 
calculation of the following quantity: 



and argued that X causes Y if and only if it is deemed positive 
(in a statistically significant sense). In many scenarios, the 
modality of the process Y is not even continuous-valued and 
thus modeling equations like ( [TT] ) and ( [TS] ) are not applicable. 
For example, suppose we would like to understand the causal 
influence between neural spike trains, or the causal influence 
between the timings of packets in a computer network | 34l. 
If we discretize time into small enough bins, then Yi is or 
1, depending on whether or not an event occurred in that bin. 
However, the right-hand side of ( [TT] ) and ( [TS] ) are continuous- 
valued. 

1) A general sequential prediction approach: We will con- 
sider a general formulation of Granger's statement, emphasiz- 
ing the words "predict" and "better". Granger's definition is 
posed in terms of how one probability model predicts better 
than another. Let X be all the processes simultaneously being 
observed. Let Y^ = X^ denote the stochastic process being 
predicted. Let X^ = X/^ be another process. The setup is that 
we are attempting to understand if X^ causes Y^ . Knowledge 
of the past of X^ is a form of causal side information. 
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Thus, the overall goal is to characterize how much causal side 
information helps in sequential prediction. We now formally 
describe a general sequential prediction setup. (See 1591 for 
an overview.) 

Denote J^t to be the sigma-algebra pertaining to information 
about the past of all processes, and Pt to be the sigma-algebra 
pertaining to information about the past of all processes 
excluding 

= (J {Xi^r :le[m]\ {k}, T <t; X/e,r : r < t) 
= CT{Xi^r'-le[m]\{k},r<t). 



where we note that q{{y}) is the probability mass function 
evaluated at outcome y. 

We assume that there is a joint distribution Px on all 
the random processes and that both agents act rationally in 
selecting their predictor to minimize loss on average. As such, 
one agent attempts to minimize his expected loss with respect 
toPx: 



q* = argminEp^ [l{qt,Yt)] 

qteAt 

and the other agent does likewise: 



(19) 



Remark 2: Note that if [m] = {1,2} and we denote 
Xi and = X2, then this reduces to 



cj{X'-\Y'-^) 



At time t, one predictor has available information about the 
past of all processes, and specifies a prediction e Q 
about yt that is -measurable. The other predictor has all 
available information apart from the past of X^ and specifies 
a prediction e Q about yt that is -measurable. Define 
the spaces of candidate predictors as: 



At 
At = 



= {q:n 
{q:n- 



^ Q s.t. q is J^-measurable} 
Q s.t. q is ^-measurable I . 

Subsequently, yt is revealed, and a loss function / : Y x Q ^ 
R+ dictates that for prediction p e Q and outcome yt, a loss 
of l{p,y) is incurred. Thus, one predictor incurs loss l{q^yt) 
and the other incurs /(g, yt). 

Denote the reduction in loss to characterize whether or not 
the former prediction was better than the second as: 

n = Kqt^yt) - Kqt^yt)- 

Analogous to how Granger selected least squares fits for the 
coefficients in ^Tf) and ([18]), it is quite natural to select rational 
agents to minimize loss with respect to the two scenarios, in 
expectation or in a minimax sense. Afterwards, analogously, 
the reduction in loss can be discussed as a measure of perfor- 
mance in a worst-case or average sense. We have discussed 
this in preliminary work in 1601 . We will next discuss the 
logarithmic loss and expected cumulative reduction in loss. 

2) The logarithmic loss: Now we consider the framework 
of the logarithmic loss. Let Y be a measurable space and let 
/i be a measure on that space. Let the space of predictors be 
the space of probability measures over Y that form a density 
with respect to /i: 

Q = {peV{Y):p^fi}. 

With this, we define the logarithmic loss as: 



Example 3: Suppose that Y C Z we let /i be the counting 
measure. Then we have that l{q^y) is simply —\ogq{{y}), 



ql = argminEp^ [l{qt,Yt) 

qteAt 



(20) 



The expected cumulative reduction in loss for both rational 
agents is given by: 



R{q\r) 



El 



We now state our main lemma, showing that the optimal 
predictors are the true conditional distributions and that the 
reduction in expected loss is precisely the causally conditioned 
directed information. 

Lemma 3.6: The optimal solutions to ([19]) and ( [20] ) are 
given by 



Qt 
Qt 



The expected cumulative reduction in loss is given by the 
causally conditioned directed information: 



Proof: Note that 



m]\{i,k} 



Qt 



argmin Ep^ 

QteAt 



1 ^^t.^. 



log- 



argmin Ep^ 

qteAt 

argmin D {Py.ijr.Wqt) 

qteAt 



Yt\Tt 



djji 



m) + log- 



dP^ 



Yt\J't 



dqt 



(21) 



where ( [2T] ) follows from the definition of divergence and 
that the left hand term in the expectation does not effect the 
argmin. Moreover, note that clearly PYt\Tt 1^ -measurable 
and thus from the non-negativity of KL divergence, q^ = 
PYt\Tt' Similarly, = P^^,^^. 

We now discuss using Granger's notion of "better" to 
address the two predictors. Since clearly q^ <C gj", note that 
the reference measure ji disappears and the reduction in loss 
becomes a log-likelihood ratio: 



niq^Qhyt) 



1 ^ 



log 



dP^ 



Yt\T, 



dP, 



-{yt 



Yt\Tt 
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Thus, 



I 



Yt\Tt 



t=l 



dR 



iXt) 



Yt\Tt 



Note that this embodies the notion of "better" because of the 
non-negativity of directed information. As such, a natural way 
to generaUze Granger's notion of causaHty is say that X causes 
Y if and only if I (x" ^ r"||X[^]\^,,fcj) > 0. 

Remark 3: Note that the formulation of Granger causality 
for linear autoregressive models is equivalent to directed 
information when Px is jointly Gaussian 1521 . ISOl . We note 
that while directed information does capture Granger causality 
in a general setting (expected cumulative regret under the 
logarithmic loss of the best possible sequential predictors), it 
does not for all contexts. For instance, if the worst-case regret 
over outcomes is used, then the resulting value of causal side 
information is different from directed information. 

Lemma [T6| shows that directed information ^ may be used 
to also quantify how much causal knowledge of process X 
helps in sequentially predicting process Y, in the sense of 
Granger. We now define a graphical model, using directed 
information, that represents the causal structure. 

Definition 3.7: For a set of random processes X, the di- 
rected information graph is a directed graph where each node 
represents a process and there is a directed edge from process 
X = X/c to process Y = X^ (for i^k e [m]) iff 



I(X ^ Y II X| 



m]\{i,/c} 



)>o. 



In this model, there is a directed edge from X to Y if 
and only if causal knowledge of X still influences Y, even 
with causal knowledge of all other processes. Since edges are 
found separately, directed information graphs are unique. Also, 
directed loops are possible. 

Remark 4: Under Assumption [T] since Granger causality 
and directed information are equivalent in the case of jointly 
Gaussian processes [51 L I52L Granger causality graphs and 
directed information graphs and are the same in that setting. 

Minimum generative model graphs and directed information 
graphs are alternative graphical models to understand the 
relationship between the past of some processes and the future 
of others in stochastic dynamical systems. We now investigate 
in what ways these perspectives are related. 

Theorem 3.8: For any joint distribution Px satisfying As- 
sumption [T] the corresponding minimal generative model 
graph and directed information graph are equivalent. 

The proof is in Appendix [B] Markov and Bayesian networks 
are different graphs, showing different aspects of the structure. 
That directed information graphs and minimum generative 
model graphs are the same suggests that they show fundamen- 
tal causal dependence structures of networks of processes. 



IV. Structural Properties of Directed 
Information Graphs 

The construction methodologies of minimum generative 
model graphs and directed information graphs are analogous 
to those of Bayesian and Markov networks, respectively. In 
this section we describe some analogous properties between 
the graphical models and investigate relationships between 
them. We first describe how, analogous to Markov chains, the 
graphical structure of directed information graphs embodies 
causal Markov chains. We then discuss how the variable 
dependence structure induced by directed information graphs 
is the unique dynamic Bayesian network. We also examine 
causal I-map properties of directed information graphs. 

A. Causal Markov chains 

In Markov networks, there is an important relationship 
between a node and its immediate neighbors. Consider a 
variable Y and its neighbor set A, which is called the Markov 
boundary. Any subset B C V\{F} of variables containing A 
is called a Markov blanket. In Figure [5(bjl {W} is the Markov 
boundary for Z. {W}, {W,X}, {W,Y} and {W,Y,X} are 
the possible Markov blankets. 

For each of its neighbors X G A, I(X; Y\ V\{X, Y}) > 
holds by construction. This is a pairwise, global test. Fur- 
thermore, let W any subset of V\{F}. The Markov chain 
W — B — F also holds and 

I(W;r) <I(B;y) 

with equality iff A C W. This follows from the data- 
processing inequality (61]. 

There is an analogous phenomenon for minimum generative 
model graphs. In constructing a minimum generative model 
(Definition |3.5|), for each process X^, a unique, minimal set 



A{i) is found that renders X^ causally conditionally indepen- 
dent of all other processes. Note that ([16]) is equivalent to 



I(X 



M\{i}\A(i) 



) = 0. 



(22) 



The set A{i) indexes the parents of X^ in the graph. We will 
call X^(^) the causal Markov boundary of X^. The causal 
Markov boundary contains all of the causal influence pertain- 
ing to X^, and conditioning on the whole network does not 
provide any more information. For any subset B{i) C [m]\{z} 
containing A{i), we call X^(^) a causal Markov blanket. 
Analogous to Markov blankets, causal Markov blankets form 
causal Markov chains. 

Lemma 4.1: Let Px be a distribution satisfying Assump- 
tion |T] X^ a process and XA(i) its causal Markov blanket. 
For any subset B{i) with A{i) C B{i) C [m]\{i} and any 
subset W{i) C [m]\{i}, the causal Markov chain XvF(i) ~^ 
^B{i) ~^ holds and 

I(X^^(,) ^ X,) < I(Xb(,) ^ Xi), (23) 

with equality iff A{i) C W{i). The proof is in Appendix [c] 

By Theorem |3.8[ since minimum generative model graphs 
are equivalent to directed information graphs, for each process 
X^, the set of parent processes are the same, it follows that the 
method of directed information graphs separately finds each 
member of the causal Markov boundaries. 
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B. Variable dependence structure of directed information 
graphs 



In Section III-A we developed minimum generative models 
by factorizing the joint distribution over time. By Assump- 
tion [T] the next step of the processes became decoupled 
conditioned on the full past, analogous to the decoupled 
differential equations in the deterministic case ([13]). By further 
removing unnecessary dependencies on other processes, we 
reduced the factorization ([14]). That reduction is sufficient for 
characterizing the unique minimum generative model graphs, 
which elucidates the temporal dependency structure relating 
the past of some processes and future of others. We can also 
examine the dependency structure of the underlying variables, 
resulting from this factorization and reduction. 

Consider the stochastic dynamical system in Example |2] 
and Figure [6] ( [T2| ) shows the reduced factorization of the 
system. The reduction was only over the processes conditioned 
on. Further reductions are often possible. For instance, if the 
system was Markov of order one, so the state of the system at 
time each time j + 1 only depended on the state at the previous 
time j, then ( [T2] ) can be further reduced to 



(24) 



n 



^3 



^3 IX?'-! '^j 



_^{dxj\xj_i,yj_i) 



Although this is a significant reduction in variable depen- 
dences, the causal dependence structure between processes 
is still represented by FigureJS] The variable dependence 
structure is shown in Figure [Tjwithout any reduction, with 
removal of unnecessary processes, and with removal of all 
unnecessary variables. By reducing all unnecessary variables, 
we obtain a Bayesian network. 

Lemma 4.2: For any distribution Px satisfying Assump- 
tion [T] if each term in the factorization ( [T4] ) for the minimum 
generative model is fully reduced, the resulting structure 
of underlying variable dependences is the unique Bayesian 
network with causal ordering. The proof is in Appendix [D] 

Since there is a unique causal ordering for any joint 
distribution satisfying Assumption [T] once the distribution is 
factorized over time ( [T3] ), the unnecessary variables can be 
removed in any order to obtain the Bayesian network of the 
underlying variable structure. That is, whole processes need 
not be removed from causal conditioning at once. 

C. Causal I-map properties of directed information graphs 

The edges of a Markov network are defined by global 
conditional dependences, where the conditioning is on the rest 
of the network. Additionally, local conditional dependences, 
such as conditioning on only a small subset of other nodes, 
can be identified through simple graphical separation. For 



instance, for Figure 5(b) while 1{Y] Z\X^ W) is known from 
construction, since the only path between Y and Z goes 
through W, we can further conclude that 1(1^; Z\W) = by 
graph separation. 



Let U, W, and Z be three disjoint subsets of V in an 
undirected graph. If every path between a node in U and 
a node in W contains a node Z, then this is denoted as 
<U I Z I W>. For example, in Figure |5(b)[ <Z \W \ Y> 
holds while < X | 1^ | F > does not. An undirected graph 
is an independency map (I-map) if, for all disjoint subsets U, 
W, and Z, 



<U I Z I w> 



ux w z. 



That is, each local or global graphical separation corresponds 
to a statistical independence relationship. Note that a complete 
graph, which has no graphical separation, is a trivial I-map. 
A graph is called a perfect map if <= also holds. 

Markov networks are unique, minimal I-maps |42|. That is, 
removing any edge renders it not an I-map. Not all systems 
have a perfect map. For Example [Tj ly X X, although they 
share an edge in the Markov network (Figure |5(b)| ). 

Bayesian networks are also minimal I-maps (42|, though not 
unique as they depend on variable ordering. Also, they have a 
different graphical separation criterion known as d-separation. 
Z d-separates U from W, denoted as <U | Z | W>d, if for 
every path (not necessarily directed) between a node in U and 
a node in W, there is a node v such that either 

1) u has both edges directed inward and neither u nor any 
of u's descendants are in Z or 

2) u does not have both edges directed inward and u e Z. 

Z 



For example, in Figure 5(a) 

X I 



< 



w 



I Y >d and < 
Y \ W >^ does 



I W >d both hold, though < X 
not, as X and W are conditionally dependent on the common 
child, Y. 

Analogously, local causally conditioned causal indepen- 
dences in a directed information graph can be determined 
through a graphical separation criterion. Let "cl-map" stand for 
"causal I-map" and "c-separate" stand for "causally separate." 

Definition 4.3: For a directed information graph for a set 
of random processes X, let U, W, and Z be three disjoint 
subsets of processes of X. Z c-separates U from W, denoted 
as <U II Z II W>, if, for every path between a node in U 
and a node in W there is a node in Z U W with an outgoing 
arrow. 

C-separation < X || Z || Y > represents that Z blocks 
influence from X to Y. Specifically, for all time j, Yj is 
independent of X^~^ given {Y^~^ Z^~^}. In terms of path 
separation, this criterion is similar to d-separation. However, 
unlike separation criteria for Markov or Bayesian networks, 
c-separation is not symmetric. 



<X II Z II Y> 



<Y II Z II X>. 



Example 4: Consider a system of five processes 
{ A, . . . , E} with a directed information graph depicted 
in Figure [8] Examples of c-separation include 

• <D II A II B> similar to d-separation. 

• < D II {A, C} II B > unlike d-separation, as C is a 
common child. 

• < C II D II A > unlike d-separation, there is no 
conditioning on B. 

• <E II II C> unlike d-separation. 
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(a) The full variable dependence structure anal- (b) The full variable dependence structure anal- (c) A Bayesian network {24} after removing all 
ogous to a Bayesian network by expressing flo) . ogous to a Bayesian network after removing unnecessary variables. 

unnecessary processes to uncover the minimal 

generative model graph fT2) . 

Fig. 7. The variable dependence structure for Example [2] with no reduction {To}, removing unnecessary process dependences {T2j, and with full reduction 
( [24) . Only dependences between variables at time j — 3 through j are depicted. Figure (c) is part of the Bayesian network. 




Fig. 8. Directed information graph of the system in Example [4] 

Remark 5: In addition to asymmetry, another important 
difference between d-separation and c-separation is depen- 
dence on common children. For d-separation, conditioning on 
common children can render the parents dependent. Bayesian 
networks depend on the variable indexing. A common child for 
one index ordering could be a common parent for the reverse 
index ordering. With directed information graphs, however, 
causally conditioning on common children (which are not 
also common ancestors) will not render their parents causally 
dependence. 

We can now define cl-maps for directed graphs. 

Definition 4.4: A directed graph for a set of processes X 
is called a cl-map if, for all disjoint subsets U, W, and Z of 
X, each c-separation corresponds to a causal Markov chain, 

<u II z II w> ^ u^z^w. 

A directed graph is a minimal cl-map if removing any of its 
edges renders it not a valid cl-map. 

Theorem 4.5: For any distribution Px satisfying Assump- 
tion [T] the directed information graph is a minimal cl-map. 

For the proof, see Appendix |E| Since directed information 
graphs are minimal c-Imaps a natural question is whether 
they are also perfect maps. That is, in addition to the global 
causally conditioned independences which are all depicted, are 
also all local ones depicted? The following example using the 



noisy XOR function shows this is not necessarily the case. 
Causally conditioning on other processes can induce either 
causal dependence or causal independence. 

Example 5: Let W, X, Y, and Z be four processes, with 
W and X independent Bemoulli(^) processes and 

for some i.i.d. Gaussian noises {^i,^-}^^^- Because of the 
properties of the XOR function 0, 

I(X ^ Z) = but I(X ^ Z II W, Y) > 0, 
I(Y ^ Z) > but I(Y ^ Z II W, X) = 0. 

We have introduced directed information graphs and mini- 
mum generative model graphs as two equivalent procedures 
for identifying the causal influence structure in a network 
of processes. To identify the influences, both procedures 
calculate divergences which use the full joint statistics. We 
now discuss a procedure for identifying the structure using 
minimal-dimension statistics when there is some knowledge 
about the structure. 

V. Identifying the Structure 

In this section, we discuss methods for identifying the 
minimal generative model graph - or equivalently, the di- 
rected information graph. The methods will take as inputs 
causally conditioned directed information values. Efficiency 
will correspond to the dimension of the statistics that will 
be necessary, such as needing pairwise statistics as compared 
to the full joint distribution. In particular, we will show that 
although the definitions of minimum general models and 
directed information graphs require the full joint statistics, 
when the number of causal parents of a process Y in the graph 
is bounded above by K, there is an algorithm to identify Y's 
parents using only (iir + l)-wise statistics. This algorithm was 



13 



introduced in preliminary work in II62I . We first examine the 
algorithms for constructing minimum generative model graphs 
and directed information graphs. 

A. General Structures 

Identifying the minimum generative model graph by the 
Definition |3.5| involves determining, for each process X^, the 
minimal cardinality set A{i) that satisfies ([16]). No search 
order is prespecified. Since the goal is to find the smallest 
A{i), one approach is to test increasing sizes of subsets of 
potential parents. For instance, first the empty set is tested, 
then individual processes, then pairs of processes, etc. This 
would require calculating an exponential number of causally 
conditioned directed informations ( [22| ). 

An alternative method is motivated by causal Markov chains 



Algorithm 2. DIconstruct 



and Lemma [4^1] To find process Y's parents, start with the the 
subset of all other processes as a trivial causal Markov blanket 
and sequentially test each process, shrinking it down into the 
causal Markov boundary. This method is formally described 
in Algorithm 1. 

Let VXmgm denote the set of all causally conditioned 
directed information values from one process to another, 
causally conditioned on a subset of the the rest: 

VlMGM = \i (Xa^^X, II X^^,)) : k, i G[m], C [m]\{i, k}} . 



Algorithm 1. MGMconstruct 



Input: VXmgm 



1. For i e [m] 



2. 
3. 
4. 
5. 
6. 



A{i) ^ [m]\{z} 
For k e A{i) 

B{z) ^ A{z)\{k} 

IfI(X,^X, ||X^.,0 = 



A{i) ^ B{i) 



Lemma 5.1: Let Px be a distribution satisfying Assump- 
tion [T] Algorithm 1 recovers the minimal generative model 
graph. 



The proof follows from Lemma 4.1 Algorithm 1 requires 
the full joint statistics. However, it only uses 0{m?) tests. 
Note that the tests used in line 5 are adaptive, using the current 
B{i). We next consider the algorithm for constructing directed 
information graphs. 

Directed information graphs are identified by testing each 
edge separately. Testing an edge entails computing all causally 
conditioned directed informations from one process to another, 
causally conditioned on all other processes. This is described 
in Algorithm 2. Let VXdi denote that set of causally condi- 
tioned directed informations: 



VX 



DI 



{I(Xfc 



n]\{iM 



) : i,k e [m]}. 



Lemma 5.2: Let Px be a distribution satisfying Assump- 
tion [T] Algorithm 2 recovers the directed information graph. 

The proof follows from Definition |3.7| Unlike Algorithm 1 , 
Algorithm 2 uses each of the 0{m'^) elements in VXdi 
regardless of the graph structure. Line 4 could be executed 



Input: VXdi 



1. For i e [m] 

2. A(i) ^ 

3. For i^j e [m] 

4. If I(Xfe ^ 



5. A{i) ^ A{i) U {k} 



-[m]\{i,k} 



)>o 




Fig. 9. A graph depicting the task of identifying the parent(s) of Y the 
directed information graph in Example |6] It is known that Y has at most two 
parents. The structure of {Xi , . . . , Xsj is not depicted. 



in parallel for every possible edge. The number of causally 
conditioned directed information tests is the same as Algo- 
rithm 2, though the tests themselves are different. Note that 
VXdi C VXmgm- 



B. Bounded In-degree 

Algorithms 1 and 2 both require inputs computed using the 
whole joint distribution of the system. Example [5] showed that 
using the full joint statistics is necessary in general, since if 
a process has every other process as a parent (for some i, 
A{i) = [m]\{i}), the full joint statistics is required to correctly 
identify this. However, we will show that when there are upper 
bounds on the in-degrees of the processes, lower-dimension 
statistics can be used. First consider a simple example. 

Example 6: Let X = {Xi, . . . , X5, Y} be a set of six 
processes with a known joint distribution Px satisfying As- 
sumption [T] and Y has at most two parents in the minimal 
generative model structure. Our goal is to identify the parent(s) 
of Y. This task is depicted in Figure [9] as any of the other 
processes could potentially be a parent. 

To determine Y's parent(s), first we will use Algorithm 1. 
To test if X3 is a parent, calculate I(X3 ^Y || X|^ 2,4,5})- If 
this value is 0, then to test next if X5 is a parent, calculate 

I(X5^Y||X|i^2,4})- Otherwise use I(X5 ^ Y || X|i^2,3,4})- 
This process continues until Xi, X2, and X4 are also tested. 

Alternatively, Algorithm 2, the definition of directed in- 
formation graphs could be used. This allows us to find 
each parent separately. To test if X3 is a parent, calculate 
I(X3 ^ Y II X{i,2,4,5})- To test if X5 is a parent, calculate 



I(X5 ^Y II X|^ 2,3,5})- These tests can be done in parallel. 

Both of these methods require the full joint distribution. 
Given the knowledge that a node has at most K parents in 
the graph, a natural question is if the full joint distribution is 
necessary. From Example |5j it is known that at least {K 
wise statistics is necessary. We now describe such a method 
which identifies parents in a distributed manner only using 
{K + l)-wise statistics. 
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For each process X^, only its causal parents carry the 
influence, so among all sets of K{i) other processes, all 
and only those that contain X^'s parents will have maximal 
directed information to X^. We can then take the intersection 
of these sets (causal Markov blankets) to get precisely X^'s 
causal parents (the causal Markov boundary). Algorithm 3 
formally describes this method. Let VXBndind denote a set 
of directed information values, such that for each i G [m], 
VXBndind contains directed information values from each 
K{i) sized subset of processes to X^. 



VX 



Bndind 



{l(XB(i) ^ Xi) : i e [m]> 

B{i)C[m]\{i},\B{i)\=m]. 



Algorithm 3. StructureRecovery 



Input: VXBndind 



1. For i e [m] 

2. A{i) ^ 

3. K{i) 

4. XL^{X:X^[m]\{i}, \X\ 

5. For Xi e Xl 

6. Compute iQixi ~^ ^0 

7. ^Lmax ^ argmax I(Xt. ^ 

8. A{i) ^ n^i 



X,) 



Theorem 5.3: Algorithm 3 recovers the minimal generative 
model structure for a given Px if for each i G [m], K{i) < 
m - 2. 

1 is a trivial 



The proof follows from Lemma 4.1 Note m 
upper bound, as that is the size of the set of all other processes, 
so there is only one candidate set. 

Algorithm 3 finds the structure using only statistics of the 
dimension of the bound of the in-degree. Thus the bound 
of the in-degree is both necessary and sufficient for recov- 
ering the structure. Algorithm 3 uses all of the elements 
in VXBndind. which are YJiLi (^(i)) values. Note that if 
the upper bounds {K{i)}^^ do not grow with m, then 
the algorithm performs (9(m^+^) directed information tests, 
where K = max^^[^] K{i). While more tests are used than in 
Algorithms 1 and 2, since only (iir+l)-wise statistics are used, 
the time to compute or estimate the causally conditioned di- 
rected information values for VXBndind could be significantly 
less than that for VXmgm or VX^i- 

Remark 6: We note that this is not the only algorithm 
that will recover the structure of trees {K = 1 for all 
nodes) using only pairwise statistics. There are methods ITSl , 
|[T9L analogous to the Chow and Liu algorithm for variables 
ifTTI , which identify the best tree approximation for causally 
conditional dependence structures. They compute the directed 
information between all pairs of processes. However, they find 
the maximum weight directed spanning tree to determine the 
best approximating structure, thus coupling identification of 
parents for different processes. A comparison of the properties 
of these tree approximation algorithms and Algorithms 1, 2, 
and 3 (with in-degree bound K = 1) is shown in Table 1. 





Distributed Search 


Pairwise Statistics 


Chow and Liu 




X 


Alg. 1 and 2 


X 




Alg. 3(K = 1) 


X 


X 



Table 1. A comparison of properties of Chow and Liu based algorithms 
[Tsl, (W\ and Algorithms 1, 2, and 3 (K = 1). Distributed search means 
that the algorithm finds the parents of a process independently of finding the 
parents of other processes. 



We next simulate a network of stochastic processes and use 
this procedure to efficiently find the causal parents of each 
process. 

VL Simulation 

In this section, we illustrate directed information graphs and 
the efficient inference method. Algorithm 3, with a simulation. 
A network of stochastic processes is simulated using a gener- 
ative model. The directed information graph is then inferred 
using Algorithm 3 for bounded in-degree. For comparison, the 
definition of directed information graphs. Algorithm 2, is also 
used for inference. 

A. Setup 

1 ) Simulation design: A stochastic network of six coupled 
point processes { A, . . . F} was simulated. We discretized 
time, using bins of length A = 1 ms. The point process model 
thus becomes 



logP{Yj=yj\Y^-': 



J-1 



= Vj log 



(25) 



where yj G {0,1} and \j is cr(F-^~^, X-^~^)-measurable, 
termed the conditional intensity function 1 631 . We used a 
generalized linear model with a Poisson link parameter. This 
class of probability models has a simple analytic form and 
have been shown to model point processes well, such as the 
communication between neurons |63|. An example of this 
distribution for a process Y at time j causally depending on 
itself and a process X for the past L time-steps is characterized 
by: 

L 

log \j = ao + ^ o-iVj-i + PiXj-i 
1=1 

for constants ao and {0^^,^}^^. In the simulation, each 
process at time j explicitly depended on its own past and the 
past of its parents for six time-steps {j — 1, . . . , j — 6}. The 
parameters were generated randomly with a normal 

distribution and fixed for the simulation. 

Figure |10(a)| depicts the causal dependencies in the gen- 
erative model used to simulate the network. By the discrete- 
time construction of the simulation. Assumption [T] holds so the 
minimum generative model graph and the directed information 
graph are the same. The network was simulated for ten minutes 
using one millisecond resolution (6*10^ time-steps). Although 
the generative model was known by design, only the simulated 
data was used to infer the causal influence structure. 
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(a) The minimum generative model used to 
simulate the network. 



(b) The graph structure inferred using Al- 
gorithm 3, assuming an in-degree bound of 
K = 3. Estimation of normalized directed 
information rate estimates for each K-sized 
subset were computed. All normalized directed 
information rate values that were within 5% of 
that maximum value were considered maximal. 
Edges B — )► A and C — )► A are missing. 



(c) The graph structure inferred using Algo- 
rithm 2. An edge was accepted if its normalized, 
causally conditioned directed information rate 
estimate was greater than 5%. It incorrectly 
identified D — )► B as an edge. 



Fig. 10. The influence structure of the minimum generative model used to simulate the network and the graphs inferred using Algorithms 2 and 3. They 
both correctly identified the presence/absence of the 30 potential edges with only one and two mistakes respectively. 



2) Directed information estimation: There are multiple 
methods that have been proposed to consistently estimate the 
directed information, under appropriate assumptions. One is a 
parametric technique which first does model fitting and then 
takes the empirical average of log likelihoods 1341 , 1351 , 1371 
and can be applicable to point process modalities. The other is 
a universal estimation technique using the context weighting 
tree method 1411 that is applicable to discrete-time processes 
with finite alphabets. We employed the parametric technique 
in 134 J . We now sketch the procedure. 

The parametric technique estimates the causally conditioned 
directed information rate |[T3]| 

1. 



I(X^^ ^ Y II X^^ 



lim -I(X^^^Y||X^J 



for given subsets Xi,X2 ^ It ^^es so by first 

estimating two causally conditioned entropy rates 

1. 



H{Y llXxJ 



lim -Ep. 

n^oo n 



■logPY||x,jY 



and 5^(Y II X^j^uxJ defined likewise. 

Then the causally conditioned directed information rate is 
their difference 

I(Xi^ ^ Y II XjJ = H{Y II XiJ - H{Y II Xi.uiJ- 

The technique assumes Py,Xx2uxi 

• is jointly stationary and ergodic; 

• is Markov of some finite order J*, with J* unknown but 
bounded; 

• belongs to a known parametric class of distributions. 
These assumptions imply Py.^yo-'^ x^~^ characterized by 
a constant parameter vector e R'^* for all time j. 

The procedure estimates causally conditioned entropies as 
follows 13411 : 

1) do maximum likelihood model fits to estimate 0{J) for 
various model orders J; 

2) use a model order selection criterion 1641 to determine 
the best-fitting model order J*; 



3) compute the empirical average of the log likelihood of 
Py ^j-i characterized by 0{J*). This average 

is the estimate i^(Y || Xxsuxj the causally condi- 
tional entropy H{Y \\ Xx^uxJ- 

Instead of using the causally conditioned directed informa- 
tion rate estimates directly, normalized values were used 



I(Xi, ^YllXjJ 



^ H{Y ||X:,J-g(Y ||X:,^uxJ 
^(Y II XiJ 



In implementing this procedure to analyze the simulated 
data, it was assumed that PyAYj-i x^~^ belonged to the 
family of generalized linear models with Poisson link function 
( [25] ). The model fits were computed using Matlab function 
glmfit{-) for a fixed model order J = 10. 

3) Graph structure identification: First, the structure was 
inferred using Algorithm 2, which follows from the definition 
of directed information graphs. Each of the 30 possible edges 
was tested separately. To test the edge A ^ B, for instance, 
I(A B II {C,D,E}) was computed. Other edges were 
tested likewise. An edge was accepted if its normalized, 
causally conditioned directed information rate estimate was 
greater than 5%. 

Additionally, Algorithm 3 was used under an assumed upper 
bound of i^T = 3. This value is a strict upper-bound for the 
network, though only one process had three parents. Thus, 
for each process, ((3) = 10) sets of candidate parents were 
considered. To find the parents of process A, for example, 
the normalized directed information rate estimates for each 
i^-sized subset of {B,C,D,E} to A were computed. The 
maximum normalized directed information rate was identified. 
Then all other normalized directed information rate values that 
were within 5% of that maximum value were considered max- 
imal. The intersection of the corresponding maximal subsets 
was taken to identify the parents. 
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B. Results 

The structure of the simulated network inferred using Al- 



gorithm 3 is shown in Figure 10(b) This is the same structure 
as the generative model, except that it is missing B ^ A and 
C ^ A. For this simulation, most of the parent sets had 
distinguishably strong influence. For instance, for inferring 
the parents of process C, the sets {A,D,F}, {B,D,F}, 
and {E,D,F} all had normalized directed information rate 
values of 62%, with the next highest value of 51% from 
set {A,D,E}. However, for inferring the parents of A, 
{B,C,E} had a estimated value of 57%, while {B,D,E} 
and {C, E, F} had values of 52%. 

The structure inferred using Algorithm 2, the definition of 




directed information graphs, is is shown in Figure 10(c) It is 
almost correct. However, this algorithm additionally detected 
the edge D ^ B. The normalized directed information of that 
edge was close to threshold, T(D B||A,C,E,F) = 6%. 
All of the other inferred edges corresponded to higher normal- 
ized values, from 11% to 51%. While this method computed 
fewer directed information estimates (30 as opposed to 60), it 
required estimating full joint statistics. 



vn. Conclusion and Future directions Assumption [Tl Let Y 



Fig. 11. Sets of processes for Lemma [aT 



Appendix A 
Proof of Lemma fT4l 

For the proof that minimum generative models are unique, 
namely that the A(i)'s can be selected independently and there 
is a unique one of minimum size, we use the following lemma. 
It is a variation of the intersection property which is used in the 
proof of uniqueness for Bayesian networks for a given ordering 
(|42| pp. 119). Essentially, the lemma says that no two subsets 
of processes can influence a process Y in exactly the same 
way, unless all the influence comes from their intersection. 
Lemma A.l: Let Px denote a joint distribution satisfying 



X„ U 



, and W 



Methods that characterize the causal dependence structure 
of a network of dynamically interacting processes could sig- 
nificantly bolster research in a number of diverse disciplines, 
including social sciences, economics, biology, and physics. 
We propose two graphical models that represent the causal 
dependence structure of such networks. One is motivated by 
generative models and the other by Granger causality. Despite 
their different perspectives, the structures they depict are 
equivalent. The presence or absence of an edge is determined 
by calculating an information theoretic divergence known as 
directed information. 

There are a number of directions for future research to facil- 
itate applying these methods to real-world data. One direction 
is improving estimation techniques of causally conditioned 
directed information. As discussed in this paper, there are 
already several estimation techniques 1341 . 1351 . l37l . l4Tl . 
Computational feasibility of current methods is an aspect 
that will need to be further explored, especially for data- 
rich applications. For instance, social networks could involve 
millions of users over the course of several years. It will 
be important to improve current estimation techniques and 
develop new ones to efficiently identify the causal dependence 
structure even of such large datasets. 

Another direction of future research involves extending 
these graphical models to time- varying network structures. The 
graphical models proposed here model dynamically interacting 
processes, but assume the causal dependence structure itself 
is time-invariant. For a number of real-world networks, the 
network influence structure varies with time, such as the inter- 
net, the brain, and social networks like Twitter. Incorporating 
these future research extensions could provide for a practical 
framework to analyze a variety of real-world networks of 
causally interacting processes. 



such that 2"u7Tw ^ Denote the intersection as V 

UpW. (See Figure fnl) 



If P>(Py||uuw||^y||u|^uuw) =0 and 
^(^Y||uuw||^Y||w|^uUw) =0, 



then 



^(^Y||UUw||^Y||v|^UUw) 



0. 



(26) 
(27) 

(28) 



Proof: Since Px is positive, the divergences ([26]), ( [27] ), 
and ( [28] ) are equivalent to conditional independence state- 
ments of the underlying random variables. For instance, using 
marginalization and non-negativity of KL divergence, ( [26] ) is 
equivalent to 

for all time 1 < j < n. The proof then directly follows from 
the intersection property ( |42 | pp. 84). ■ 



We can now prove Lemma 3.4 



Proof: Suppose not. Let A and B be distinct two minimal 
generative models for Px- Let Y = X^ for some i G [m] 
be any process for which A{i) ^ By definition of 

minimal generative models, properties of the logarithm, and 
non-negativity of KL-divergence, 



and 



Thus, by Lemma A.l 
^(iViix 



llPviix Px 

m]\{i} II ^ \\^A{i)nB{i) I —[m]\{i} 



0. 



This is a contradiction, as |A(z)nP(z)| < \A{i)\ but \A{i)\ is 
minimal by definition. ■ 
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Appendix B 
Proof of Theorem I3.8I 

Proof: Let {A{i)}'^^ denote the parent sets in the min- 
imal generative model, and let X = X/c and Y = X^ be 
processes for some i^k G [m]. We consider two cases. First 
suppose that there is no edge from X to Y in the minimal 
generative model graph. By definition, k ^ A{i). Thus, 



0, 



(29) 



and consequently 

I(X^Y||X[„]\^,,,j) = 0. (30) 

This means that there is no edge from X to Y in the directed 
information graph either. 

Now suppose that there is no edge from X to Y in the 
directed information graph. Equation ( [30| ) holds which implies 
([29]). However, suppose there is a directed edge from X to Y 
in the minimal generative model graph. Thus k G A{i), and 
\A(i)\ is minimal. With ( [29] ) this implies by Lemma p\.l[ 



0. 



This is a contradiction because \A{i)\ is minimal. Therefore 
there is no edge in the minimal generative model graph either. 

■ 

Appendix C 
Proof of Lemma [4J1 

The proof is based on applying the chain rule different ways. 



non-negativity of directed information, and Lemma 3.4 
Proof: First apply the chain rule different ways: 

= I(Xb(,) ^ X,) +I(Xh.(.) ^ X,||X^(,)) (31) 

Note by definition of A{i) in the minimum generative 
models, 

I(X\y(i)uB(i) ^iW^Aii)) = 0- 

Next apply the chain rule to I(X^(^) 
combine with ([32l). 



Xi) in ([31]) and 



I(X^(,) ^ X,) = I(X^(,) ^ X,) 

Consequently, since directed information is a KL divergence 
and thus nonnegative, I(Xv^(^) X^||X^(^)) = and 

l{Xwii) ^ X,) < I(Xb(,) ^ Xi). 

This is the inequality ([23]) in Lemma |4.1[ Equality occurs 



when I(XB(i) 
divergence 



^ X,||X 



W(i)J 



0. This corresponds to the 



^(^X,||XH^(,)UB(^)II^X,||XH.(,)|^X 



W(i)UB(i) 



Since we also have that 

^XJIX 



W(i)UB(i) ll"^Xi ||X^(i) l-PXi 



/(i)UB(o) 



0, 



by Lemma 3.4 A{i) C W{i), else A{i) would not be minimal 
which is a contradiction. ■ 



Appendix D 
Proof of Lemma [4?21 

First consider the procedure for constructing a Bayesian net- 
work, outlined in Section II-B For each variable X^, let A^ C 
{Xi, X2, . . . , denote a Markov boundary of Xi with 

respect to the set of preceding variables {Xi, X2, . . . , X^_i}. 
The directed acyclic graph (DAG) formed by setting each 
member of A^ as a parent of X^ is called a boundary DAG 
with respect to the index order. By |65|, boundary DAGs 
are Bayesian networks (minimal I-maps under d- separation). 
This allows for a relatively simple procedure (factorization and 
reduction) to identify a minimal I-map, regardless of the initial 
ordering. We now prove Lemma [4~2] 

Proof: The minimum generative model specifies an or- 
dering to apply the chain rule. The ordering is first over time, 
and, with Assumption [T] the variables at the same instance 
are conditionally independent given the past. Consequently, 
any causal ordering results in the same factorization. 

Next, some variable dependencies are removed. Specifically, 
for each process X^, processes not in its causal Markov 
boundary are unconditioned on for each conditional term 
.\^3-i • There is still conditioning on the full past of all 
processes in X^'s causal Markov boundary. By further remov- 
ing all unnecessary variable dependencies for each conditional 
term, which can be done uniquely since the joint distribution 
is positive, a boundary DAG, and thus a Bayesian network, 
for the variable dependence structure is obtained. ■ 



Appendix E 
Proof of Theorem 14.51 

The proof is based on d-separation in the Bayesian network 
of variable dependencies underlying the directed information 
graph. We will use the following corollary to simplify the 
argument. 

Corollary E.l: Let Px be any distribution satisfying As- 
sumption [T] Consider its directed information graph and the 
corresponding induced Bayesian network. Any path between 
processes in the Bayesian network corresponds to the same 
path (possibly reusing edges) in the directed information 
graph. 



The proof follows from Lemma 4.2 We now prove Theo- 
rem [43] 

Proof: We first show that Z U W d-separates U from the 
parents of W not in Z. We then show this fact implies the 
result. 

Let T denote the set of nodes that have a child in W and 
are not in Z. Any path p from a node in U to a node in T 
can be extended by one edge into a path from U to W. Thus, 
by definition of c-separation, there must be a node in Z U W 
that is on path p with an outgoing edge. 

Thus, we have that in the directed information graph, ZUW 
d-separates U from T. By Corollary |E.1| this corresponds to 
d-separation in the Bayesian network. For all time j. 



(33) 
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Since ZUT forms a causal Markov blanket for W, we have 
that 

U^-^ X W^. I Z^-\W^-\T^-^ (34) 

By the contraction property 0421 pg. 84), equations ( [33] ) and 
([34]) imply 

U^-^ X{W^.,T^-^} I Z^-\W^-^ 

By decomposition, 

u^-^ X w^. I z^-\w^-\ 

or, equivalently, the following causal Markov chain holds: 

u ^ z ^ w. 

This proves that directed information graphs are cl-maps. We 
next consider minimality. 

Consider any two processes and X/e in X with 
an edge X^ X^. Removing this edge means < 

X/c II X[^]\|^ II Xi > which implies that X/e 
2^[m]\{i,fe}^ holds. By construction of the directed infor- 
mation graph, this causal independence statement is incorrect, 
leading to a contradiction. Thus, directed information graphs 
are minimal cl-maps. 

■ 
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